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Abstract 

We consider systems of interacting reinforcement learning (RL) al- 
gorithms that do not “work at cross purposes” , in that their col- 
lective behavior maximizes a global utility function. We call such 
systems Collective INtelligences (COINs). We present the the- 
ory of designing COINs. Then we present experiments validating 
that theory in the context of two distributed control problems: We 
show that COINs perform near-optimally in a difficult variant of 
Arthur’s bar problem [Arthur] (and in particular avoid the tragedy 
of the commons for that problem), and we also illustrate optimal 
performance in the master-slave problem. 


1 INTRODUCTION 

Collective INtelligences (COINs) are (usually) very large, sparsely connected re- 
current neural nets, whose “neurons” are themselves reinforcement learning (RL) 
algorithms. The distinguishing feature of a COIN is that its dynamics involves no 
centralized control, but only the collective effects of the individual neurons each 
modifying their behavior via their (local) RL algorithms. This restriction holds 
even though the goal of the COIN concerns the system’s global behavior. 

One naturally-occuring COIN is a human economy, where the “neurons” consist of 




individual humans trying to maximize their reward, and the “goal”, for example, 
can be viewed as having the overall system achieve high gross domestic product. 
The robustness and power of naturally-occuring COINs suggests it would be useful 
to construct Artificial COINs (ACOINs), for example as controllers of distributed 
systems. This paper presents a preliminary investigation of this idea in the context 
of distributed control. 

In designing an ACOIN, we start with a global utility function concerning global 
behavior. Our task is to initialize and then update the neurons’ utility functions so 
that as the neurons improve their utilities, global utility also improves. (We may 
also wish to update the local topology of the net to the same end.) In particular, 
we need to ensure that the neurons do not “frustrate” each other as they attempt 
to increase their utilities. All of this must be done without centralized control. 

We refer to the RL algorithms at each neuron as microlearning , and the updating of 
the ACOIN is known as macroleaming. For robustness and breadth of applicability, 
we assume essentially no knowledge concerning the dynamics of the full system, i.e., 
the macrolearning and/or microlearning must “learn” that dynamics, implicitly or 
otherwise. This rules out any approach that models the full system. 

The problem of designing an ACOIN has never previously been addressed in full — 
hence the need for introducing new concepts like those described below. Nonethe- 
less, this problem is related to previous work in many fields: distributed artificial 
intelligence, multi-agent systems, computational ecologies, adaptive control, com- 
putational economics, game theory (and in particular stochastic game theory, evo- 
lutionary game theory, and game-theoretic mechanism design), voting and auction 
theory, computational markets, iterated function systems, active walker models, 
self-organized criticality, spin glasses, reinforcement learning, recurrent neural nets, 
Markov decision process theory, network theory, and ant-based optimization. 

WE SHOULD PROBABLY NOT CROSS REFERENCE SO THAT TWO PA- 
PERS WILL NOT BE COMPARED FOR SIMILARITY! 

In an associated paper [ngi], we design an ACOIN for distributed internet rout- 
ing, achieving performance better than that of all previous RL-based approaches 
to internet routing [Littman, OTHERS]. In this paper we instead concentrate on 
improving our understanding of COINs. We do this by using computer experiments 
to investigate the theory of COINs in the context of two abstract distributed con- 
trol problems: a more challenging variant of Arthur’s bar problem in economics 
[Arthur] and the master-slave problem. In the context of these problems we vali- 
date some of the predictions of the theory of COINs. In particular, we confirm the 
necessity of “constraint alignment” for having the use of “wonderful life” neuronal 
utility functions result in optimal global performance: When there is constraint- 
alignment, the two COINs achieve 99% of the optimal for the bar problem and 
100% of the optimal for the master-slave problem. In contrast, natural non-COIN- 
based utility functions encounter difficulties like tragedy of the commons for these 
two control problems, resulting in performance levels of at best 24.6% and 9% for 
the two problems investigated, respectively. 

The next section presents an overview of those aspects of the mathematics be- 
hind ACOINs that are investigated in this paper. The following section presents 
computer experiments concerning that mathematics in the context of distributed 



control. The final section presents conclusions and summarizes future research di- 
rections. 

2 MATHEMATICS OF COINS 

The mathematical framework for COINs in general is quite extensive [Us]. This 
paper concentrates on four of the concepts from that framework: subworlds, factored 
systems, constraint-alignment, and the wonderful-life utility function. 

i) We consider the state of the system across a set of T -f 1 consecutive timesteps, 
which we write without loss of generality as t € [0, ..., T). Let £ be the (Euclidean- 
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vector-valued) state of neuron 77 at time t. World utility is a function of the state 
of all neurons across all £, which we write as G{Q. 

ii) To simplify the design of the ACOIN, we group neurons into subworlds , all 
sharing the same subworld utility . In the ACOINs considered here, subworlds have 
no overlap, and a neuron’s microleamer receives the utility of that neuron’s subworld 
as its reward signal. We indicate the state of neuron a in subworld u at time t by 

t , and by ^ W (C) we indicate the subworld utility of subworld u. 

in) One of the primary desiderata in designing an ACOIN is to have it be subworld- 
factored. This means that a change at time 0 to the state of the neurons in subworld 
w results in an increased value for <? W (C) iff it results in an increased value for G{Q. 

For a subworld-factored system, u>’s increasing its own utility does not have “side- 
effects” on the rest of the system that decrease world utility. For these systems, the 
separate neurons pursuing their separate goals do not frustrate each other, as far as 
world utility is concerned. In particular, consider a Nash equilibrium of the system, 
i.e., a state in which no neuron can increase its utility by changing its action, given 
the actions of the other neurons [Nash, GT]. The following is proven in [us]: 

Theorem 1: If a system is subworld- factored, then any state at time 0 of the system 
that constitutes a Nash equilibrium for the subworld utilites is a critical point of 
world utility. Conversely, a maximum of world utility is such a Nash equilibrium. 

In other words, for a subworld-factored system, although it may be maximized at 
other points as well, world utility is assuredly maximized when each of the neurons’ 
reinforcement learning algorithms are performing as well as they can, given each 
others’ behavior. So we have the correct behavior if and when the learners reach 
an equilibrium. There can be no tragedy of the commons [tote], etc. 

iv) A (perfectly) constraint- aligned system is one in which any change to the state 
of the neurons in subworld u ; at time 0 will have no effects on the states of neurons 
outside of u at times later than 0. 

v) Let Z ero^iC) be defined as the vector £ modified by setting the values of all 
neurons in subworld w across all time to 0. The wonderful life subworld utility 
(WLU) is g„[Q = G{Q - <7 (Zero* (0). 

Intuitively, the WLU is analogous to the change in world utility that would have 
arisen if subworld uj had never existed. (Hence its name - cf. the Frank Capra 
movie.) Note though that no assumption is made that Zero u) (Q is consistent with 
the dynamics of the system. In particular, to evaluate the WLU we do not have 


to infer how the system would have evolved if all neurons in u> were set to 0 at 
time 0. So long as we know ( (by observation, in the process of macrolearning that 
occurs at time T), and so long as we know G (set in the problem specification of the 
ACOIN), we can evaluate the WLU. This is true even if we nothing of the dynamics 
of the system. This dynamics-independence is a major strength of the WLU. 

In addition to assuring the correct equilibrium behavior, there exist many other 
theoretical advantages to having a system be subworld-factored [us]. In this paper 
we use computer experiments to establish the applicability of these advantages to 
PROBLEMS. In particular, the following theorem, proved in [us], serves as the basis 
for the investigation of this paper: 

Theorem 2: A constraint-aligned system with wonderful life subworld utilities is 
subworld-factored. 

3 EXPERIMENTS 

Two sets of experiments were conducted to investigate the theory behind ACOINs. 
The first was a variant of the bar-attendance model [Arthur, 94], the second was a 
more complicated model involving master/slave relationships. 

Arthur [Arthur, 94] proposed a ‘bar-attendance’ model of economic market behavior 
which contained heterogeneous agents. This model has proven to be relatively easy 
to optimize [REFERENCE!!!]. To make the problem more challenging, the following 
variant was investigated. 

The problem consists of N patrons that each pick one of 7 days to attend the bar 
on the following week. The bar has a fixed capacity. Each patron via reinforcement 
learning must predict which night to attend on the following week based upon past 
rewards. The global behavior was to have G{x) = x k exp — Xk/c optimized 

given the capacity constraint c, where x is the attendance on one day. Three types 
of rewards were explored: 1. Tragedy- Of-the- Commons (TOC), 2. Global, and 3. 
Wonderful Life Utility. 

The Tragedy-of-the-commons [Crowe, 69] used a local utility of g(x) — G{x)/x. 
The Global used g{x) = G{x). This function resulted in everything being subworld 
factored. The wonderful life utility used g u {x) = G{x) - G(Zerou(x)). This 
function was close to constraint alignment and close to being sub world factored. 
The days in the week were not always treated equally. In one case only the middle 
day of the week was used in the utility calculations. In the other case all days were 
treated equally. The convergence rate of the three utility functions is compared. 
The resulting bar attendances are also compared. 

The intent of the Master-slave experiments was to demonstrate how a poor choice 
in subworld memberships can adversely affect the Wonderul Life utility by the 
non-alignment of constraints (see Theorem 2). The bar problem was modified to 
resemble society where the actions of one member have an effect on other members 
of society. In the Master-slave experiments the capacity constraint on the bar was 
removed. Patrons were assigned to belong to clubs, where each club had at least 
one master, and each master had two slaves. Patrons were rewarded according to 
how many other patrons from the same club were also present at the bar on the 
same night. Slave patrons were required to take the same action as their master 


patron. 


The same set of utility functions were investigated for these Master- slave experi- 
ments. For the wonderful life utility, the closeness to constraint alignment depended 
upon how the subworld memberships were chosen. The sub world membership that 
resulted in the most constraint alignment was when members of the same club be- 
longed to the same subworld. The worst case was when members of the same club 
were ail in different sub worlds. The third type of subworld membership explored 
was to assign the membership randomly. 

4 RESULTS 

Figures 1 and 2 show the convergence properties of the global performance in the 
bar problem for the three utility functions. In both cases the Wonderful Life utility 
converged faster to a better solution. The global utility would eventually achieve the 
same optimum but is very slow to converge. Figure 3 depicts the average deviation 
from the optimal daily attendance solution for the three utilities. 

Table 1 shows the mean performance for the Master-slave problem for different 
subworld memberships using the Wonderful Life utility, and the performance of 
the TOC and Global utilties. The Global utility always had constraint alignment 
and was therfore able to achieve optimum. The Wonderful Life utility had perfect 
alignment when the subworld memberhips correspond to the club membership. 

Percentage of cases when random subworld membership assignment was able to 
surpass TOC in performance varied with the number of clubs as shown in Table 2. 
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Table 1: Mean performance of Master-Slave Problem 


Utility 

Subworld 

Mean 

No. 

Type 

Assignment 

Performance 

Clubs 

Wonderful 

Club 

0 

2 

Wonderful 

Worse 

-12 

2 

Wonderful 

Random 

-8.68 

2 

TOC 


-1.16 

2 

Global 


6 

2 

Wonderful 

Club 

6 

3 

Wonderful 

Worse 

-18 

3 

Wonderful 

Random 

-15.58 

3 

TOC 


-1.75 

3 

Global 


0 

3 

Wonderful 

Club 

0 

4 

Wonderful 

Worse 

-24 

4 

Wonderful 

Random 

-21.76 

4 

TOC 


-2.34 

4 

Global 


0 

4 


Table 2: Percentage of random subworld membership better than Tragedy of the 
Commons 


No. of 
Clubs 

Percent 

Better 

2 

21% 

3 

2% 

4 

0.08% 



Figure 1. Average convergence when only middle day matters. 






Figure 2. Average convergence with all days equal. 



Days of Week 

Figure 3. Average attendance deviation with all days equal. 




